Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset Prometheus request tags on conf reload #2728

Closed

Conversation

elafarge
Copy link
Contributor

@elafarge elafarge commented Jul 2, 2018

What this PR does:
On every backend configuration reload, Prometheus collectors for
request metrics (bearing the upstream_ip tag) are being recreated to
avoid keeping on exposing metrics for dead upstreams on the /metrics
endpoint.

Also, we changed the structure of the collector.go file into a
singleton (to make it possible for the ingress controller to ask for a
Prometheus collector reset on these metrics on config. reloads).

Why we need it:

When an upstream pod dies (because of a scale down operation, a rolling
upgrade...), metrics associated with this given pod via the upstream_ip
tag remain exposed on the /metrics endpoints. This cause three
undesireable effects:

  • Analyzing these metrics on Prometheus/Grafana can be impossible:
    queries easily tend to time out or frontends tend too freeze
    because too many timeseries are returned
  • Payloads periodically fetched by Prometheus can get very big over
    time on production clusters, which also make them more expensive to
    ingest and store
  • Since these metrics stay in memory, it means we're having a (slow)
    memory leak.

Which issue this PR fixes:
I didn't open an issue for that and filed a PR directly, but that can be done if required.

Special notes for your reviewer:
If the "singleton" design pattern bothers you (because you were intending to instantiate multiple collectors in the future for some reason), or anything else. I'll be happy to promptly refactor the PR.

What could we break:
On infrastructure where the config would be reloaded at a high pace (higher than the Prometheus scrape_interval, we introduce discontinuities in our timeseries. It's something that Prometheus rate() (and similar) function handle very well. However, if backend reconfiguration frequency is higher that the Prometheus scrape_interval, it's possible that we get incoherent.
If so, we may just rate-limit the metric flushing operation to 1 every "user-configurable amount of time" (ex. 5 minutes). As long as this time is superior to a couple of Prometheus scrape_intervals, we should be totally fine :)

PR Status:
The PR compiles, and his being tested on our staging infrastructure. So far so good :)
I'll try it on our production clusters today and compare the size of the payloads of the /metrics endpoint before and after the upgrade and also check that discontinuities during intense scale-in operations

TL&DR
-----
- makes analysing queries (in particular, over the past few hours) on
  Prometheus/Grafana much lighter
- on every backend configuration reload, prometheus metrics about
  request values are flushed
- fixes a (reasonable) memory leak

Description of the issue
------------------------
When an upstream pod dies (because of a scale down operation, a rolling
upgrade...), metrics associated with this given pod via the upstream_ip
tag remain exposed on the /metrics endpoints. This cause three
undesireable effects:
- Analyzing these metrics on Prometheus/Grafana can be impossible:
  queries easily tend to time out or frontends tend too freeze
  because too many timeseries are returned
- Payloads periodically fetched by Prometheus can get very big over
  time on production clusters, which also make them more expensive to
  ingest and store
- Since these metrics stay in memory, it means we're having a (slow)
  memory leak.

What this commit does
---------------------
On every backend configuration reload, Prometheus collectors for
request metrics (bearing the upstream_ip tag) are being recreated.

Also, we changed the structure of the collector.go file into a
singleton (to make it possible for the ingress controller to ask for a
Prometheus collector reset on these metrics on configuration reloads).
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: elafarge
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: justinsb

Assign the PR to them by writing /assign @justinsb in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 2, 2018
@codecov-io
Copy link

Codecov Report

Merging #2728 into master will decrease coverage by 0.05%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #2728      +/-   ##
=========================================
- Coverage   40.96%   40.9%   -0.06%     
=========================================
  Files          72      72              
  Lines        5085    5099      +14     
=========================================
+ Hits         2083    2086       +3     
- Misses       2713    2725      +12     
+ Partials      289     288       -1
Impacted Files Coverage Δ
cmd/nginx/main.go 24.81% <0%> (+2.18%) ⬆️
internal/ingress/metric/collector/collector.go 2.85% <0%> (-0.23%) ⬇️
internal/ingress/controller/controller.go 2.39% <0%> (-0.01%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 92474ed...113597e. Read the comment docs.

@aledbf
Copy link
Member

aledbf commented Jul 2, 2018

Closing. This is being done in #2726

@aledbf aledbf closed this Jul 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants